Encodings of Cladograms and Labeled Trees
نویسنده
چکیده
This paper deals with several bijections between cladograms and perfect matchings. The first of these is due to Diaconis and Holmes. The second is a modification of the Diaconis-Holmes matching which makes deletion of the largest labeled leaf correspond to gluing together the last two points in the perfect matching. The third is an entirely new encoding of cladograms, first as a bijection with a certain set of strings and then via this to perfect matchings. In this pair of bijections, deletion of the largest labeled leaf corresponds to deletion of the corresponding symbols from the string or deletion of the corresponding pair from the matching. These two new bijections are related through a common max-min labeling of internal vertices with two different choices for the label of the root node. All these encodings are extended to cladograms with edge lengths and left-right ordered children. Moving a single symbol in this last encoding corresponds to a subtree prune and regraft operation on the cladogram, making it well suited for use in phylogentics software. Finally, a perfect Gray code for cladograms is derived from the bar encoding, along with a total ordering on all cladograms, Algorithms are also provided for finding the next and previous cladogram, the cladogram at any position, and the position of any cladogram in the sequence. A cladogram with n leaves is a rooted binary leaf labeled tree with leaves distinctly labeled 1, . . . , n. It has long been known that the number of such trees with exactly n leaves is (2n − 3)!!. This is also the number of perfect matchings on 2(n − 1) points. Diaconis and Holmes give a bijection in [7] between the set of cladograms and perfect matchings. ∗Research supported by Stanford Mathematics Department and NSF grant #0241246 the electronic journal of combinatorics 17 (2010), #R54 1 Currently, cladograms are most often encoded in variants of the Newick or New Hampshire format. This is an enrichment of parenthesis notation which allows additional information such as edge-lengths to be included. However, a major drawback of Newick notation is that there is in general not a unique representation for a cladogram. For example, testing equality of large cladograms given in Newick format is a non-trivial task. For this reason, a bijection is preferable. One such bijection is that of Diaconis-Holmes. This is used in the R package APE (Analysis of Phylogeny and Evolution [14]) because it provides a unique and compact representation of a cladogram, and in a fast-mixing random walk on cladograms [6]. While simple and elegent, this bijection can be improved upon. A desirable property which the Diaconis-Holmes bijection lacks is deletion-stability. There is a natural projection from the set of cladograms with n leaves to the set with n−1 leaves: deletion of the n-th leaf. For the Diaconis-Holmes bijection the induced map on perfect matchings is not natural. A second direct bijection between cladograms and perfect matchings is presented here, called the hat encoding. This is an alteration of the Diaconis-Holmes bijection which makes deletion of the leaf labeled n correspond to gluing together the last two points in the matching. Algorithms are provided for finding the matching corresponding to a cladogram and the cladogram corresponding to a matching. A completely new encoding of cladograms is also presented, called the bar encoding. This coding is a bijection between cladograms with n leaves and a subset of permutations of the set {2, 2̄, 3, 3̄, . . . , n, n̄}. This string of symbols is called the name of a cladogram. Deletion of the leaf labeled n corresponds to deletion of the symbols n and n̄ from the name. The set of names is in natural bijection with the set of matchings on 2n−2 points. For a cladogram with n leaves, deletion of the leaf labeled n corresponds to removing the last pair in the matching (pairs are labeled by starting at the last point in the set and moving to the first, labeling pairs n to 2 in the order they are first encountered). The hat and bar encodings both involve labeling the internal vertices of a tree. Both of these labelings may be easily described in terms of maxmin labeling, covered in Section 4. Which of the labeling is generated depends on the choice of label for the root vertex. The bar encoding is also used to give a perfect Gray code on the set of cladograms with n leaves. In this case, the Gray code is a sequential ordering of the set of cladograms so that adjacent cladograms differ by a small amount, specifically a subtree prune and regraft operation. Algorithms are provided to find the name of the next and previous cladogram in the Gray code. Algorithms are also provided which return the position of a cladogram in the Gray code given its name, and the name of the cladogram in a given position. Such functions are sometimes called ranking and unranking functions, such as those for the set of permutations given by Myrvold and Ruskey [13]. The Combinatorial Object Server [16] uses such functions to provide indexed lists for many types of objects but does not yet serve cladograms. The necessary basic definitions are now reviewed. Recall that a tree is a simple graph of vertices and edges with precisely one non-selfintersecting path between any two vertices. the electronic journal of combinatorics 17 (2010), #R54 2 A cladogram with n leaves is a finite rooted binary tree with non-root leaves distinctly labeled 1, 2, . . . , n. Note that the planar representation of the cladogram is not important: ie. ‘left’ and ‘right’ children are not distinguished. A fat cladogram, or oriented cladogram, is a cladogram where the children of each vertex are distinguished as the ‘left’ child and the ‘right’ child. In other words, the edges around each vertex have a cyclic ordering.
منابع مشابه
Probabilities on cladograms: introduction to the alpha model
The alpha model, a parametrized family of probabilities on cladograms (rooted binary leaf labeled trees), is introduced. This model is Markovian self-similar, deletion-stable (sampling consistent), and passes through the Yule, Uniform and Comb models. An explicit formula is given to calculate the probability of any cladogram or tree shape under the alpha model. Sackin's and Colless' index are s...
متن کاملOn Consensus, Collapsibility, and Clade Concordance
Consensus in cladistics is reviewed. Consensus trees, which summarize the agreement in grouping among a set of cladograms, are distinguished from compromise trees, which may contain groups that do not appear in all the cladograms being compared. Only a strict or Nelson tree is an actual consensus. This distinction has implications for the concept of support for cladograms: only those branches s...
متن کاملThe probabilities of trees and cladograms under Ford's $\alpha$-model
We give correct explicit formulas for the probabilities of rooted binary trees and cladograms under Ford’s α-model.
متن کاملPatching Up X-trees
A fundamental problem in many areas of classification, and particularly in biology, is the reconstruction of a leaf-labeled tree from just a subset of its induced subtrees. Without loss of generality, we may assume that these induced subtrees all have precisely four leaves. Of particular interest is the question of determining whether a collection of quartet subtrees uniquely defines a parent t...
متن کاملPACT: an efficient and powerful algorithm for generating area cladograms
Formal methods of historical biogeographical analysis using phylogenetic trees began appearing more than 25 years ago (Platnick & Nelson, 1978). Today, two classes of methods for documenting historical biogeographical patterns exist. All begin by converting phylogenetic trees into taxon–area cladograms (Morrone & Carpenter, 1994; Enghoff, 1996), Department of Zoology, University of Toronto, Tor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Electr. J. Comb.
دوره 17 شماره
صفحات -
تاریخ انتشار 2010